Skip to content

Conversation

@kikisdeliveryservice
Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice commented Nov 4, 2019

After a lot of testing TestFips is broken bc Day2 Fips is broken.
This test will be unneeded once #1233 merges and will unblock our approved PRs that can't get past our e2e test.

Also kept PDB changes from #1238 - we are well within our testing time, so there's no issue there.

Related-to: #1233

IIRC we did this just to speed up these tests because updating
workers 1 by 1 blew out our hour budget.

The router requires a minimum of two workers though, and we're just
going to be fighting its PDB.

Since customers can't sanely do this, let's stop doing it in our
tests.  If our tests take too long...we'll have to either cut
down the tests or make them a periodic, etc.
@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 4, 2019
@kikisdeliveryservice kikisdeliveryservice changed the title e2e-gcp-op: testing some changes to see if we can pass [WIP] e2e-gcp-op: testing some changes to see if we can pass Nov 4, 2019
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 4, 2019
@kikisdeliveryservice
Copy link
Contributor Author

/skip

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 4, 2019
@kikisdeliveryservice
Copy link
Contributor Author

e2e-gcp-upgrade failure reported:
Latest error: Error setting IAM policy for project \"openshift-gce-devel-ci\": googleapi: Error 409: There were concurrent policy changes. Please retry the whole read-modify-write with exponential backoff., aborted"

@kikisdeliveryservice
Copy link
Contributor Author

/skip

@kikisdeliveryservice
Copy link
Contributor Author

99 failures and the failure im looking for ain't one. =(

/retest

@kikisdeliveryservice
Copy link
Contributor Author

Also testing in my own cluster just in case this run fails too..

@kikisdeliveryservice
Copy link
Contributor Author

kikisdeliveryservice commented Nov 5, 2019

ok locally on my own cluster, the e2e runs as-is (minus max unavailable bump and minus FIPs test).

fyi:
ok github.com/openshift/machine-config-operator/test/e2e 2432.196s

@kikisdeliveryservice
Copy link
Contributor Author

So running the tests I saw:

=== RUN   TestMCDToken
--- PASS: TestMCDToken (0.72s)
=== RUN   TestMCDeployed
--- PASS: TestMCDeployed (957.48s)
    mcd_test.go:155: Created add-a-file-b706da87-451c-4707-9c89-cbebb16481ed
    mcd_test.go:115: Pool worker has rendered config add-a-file-b706da87-451c-4707-9c89-cbebb16481ed with rendered-worker-b5c94c0a9a192a9d34a710f2b476ea23 (waited 6.160621494s)
    mcd_test.go:137: Pool worker has completed rendered-worker-b5c94c0a9a192a9d34a710f2b476ea23 (waited 3m48.086067323s)
    mcd_test.go:166: All nodes updated with add-a-file-b706da87-451c-4707-9c89-cbebb16481ed (3m54.496017952s elapsed)
    mcd_test.go:155: Created add-a-file-0db99933-68f8-4dfe-8ed1-027bce275842
    mcd_test.go:115: Pool worker has rendered config add-a-file-0db99933-68f8-4dfe-8ed1-027bce275842 with rendered-worker-16461e13b2afdb84f7d98a13e6aa8296 (waited 4.161636515s)
    mcd_test.go:137: Pool worker has completed rendered-worker-16461e13b2afdb84f7d98a13e6aa8296 (waited 3m56.087283733s)
    mcd_test.go:166: All nodes updated with add-a-file-0db99933-68f8-4dfe-8ed1-027bce275842 (4m0.494787031s elapsed)
    mcd_test.go:155: Created add-a-file-b10f1488-2b85-4f46-8f77-b447eefac695
    mcd_test.go:115: Pool worker has rendered config add-a-file-b10f1488-2b85-4f46-8f77-b447eefac695 with rendered-worker-18893ce3dabce991e3a52421d479b03a (waited 4.162255617s)
    mcd_test.go:137: Pool worker has completed rendered-worker-18893ce3dabce991e3a52421d479b03a (waited 7m58.082184158s)
    mcd_test.go:166: All nodes updated with add-a-file-b10f1488-2b85-4f46-8f77-b447eefac695 (8m2.484391602s elapsed)
=== RUN   TestUpdateSSH
--- PASS: TestUpdateSSH (260.44s)
    mcd_test.go:214: Created sshkeys-worker-92783f18-3a86-42ac-8b89-16a3adfbc19f
    mcd_test.go:115: Pool worker has rendered config sshkeys-worker-92783f18-3a86-42ac-8b89-16a3adfbc19f with rendered-worker-1c3519e89a8e37ec1b3a8e94b7c7bcdc (waited 4.163397747s)
    mcd_test.go:137: Pool worker has completed rendered-worker-1c3519e89a8e37ec1b3a8e94b7c7bcdc (waited 4m12.082127984s)
    mcd_test.go:240: Node ip-10-0-128-101.ec2.internal has SSH key
    mcd_test.go:240: Node ip-10-0-137-196.ec2.internal has SSH key
    mcd_test.go:240: Node ip-10-0-153-186.ec2.internal has SSH key
=== RUN   TestKernelArguments
--- PASS: TestKernelArguments (288.27s)
    mcd_test.go:259: Created kargs-8f59a5dc-4caf-4236-8031-f298775df7e5
    mcd_test.go:115: Pool worker has rendered config kargs-8f59a5dc-4caf-4236-8031-f298775df7e5 with rendered-worker-632afc6a58c46eb8a5ea4054a798c11b (waited 2.164606055s)
    mcd_test.go:137: Pool worker has completed rendered-worker-632afc6a58c46eb8a5ea4054a798c11b (waited 4m42.082828418s)
    mcd_test.go:282: Node ip-10-0-128-101.ec2.internal has expected kargs
    mcd_test.go:282: Node ip-10-0-137-196.ec2.internal has expected kargs
    mcd_test.go:282: Node ip-10-0-153-186.ec2.internal has expected kargs
=== RUN   TestPoolDegradedOnFailToRender
--- PASS: TestPoolDegradedOnFailToRender (16.51s)
=== RUN   TestReconcileAfterBadMC
--- PASS: TestReconcileAfterBadMC (113.02s)
    mcd_test.go:115: Pool worker has rendered config add-a-file-a60631ad-1b50-44dd-829b-a623ff4fd5af with rendered-worker-d18ac8ad948383721248f668564c4767 (waited 4.166286781s)
    mcd_test.go:137: Pool worker has completed rendered-worker-632afc6a58c46eb8a5ea4054a798c11b (waited 1m36.082760528s)
=== RUN   TestDontDeleteRPMFiles
--- PASS: TestDontDeleteRPMFiles (654.84s)
    mcd_test.go:115: Pool worker has rendered config modify-host-file-bf169114-152c-4413-ba2b-62d4ac4f56ad with rendered-worker-6b2386a923650bbf8462be6c2ed3dcfe (waited 2.165140405s)
    mcd_test.go:137: Pool worker has completed rendered-worker-6b2386a923650bbf8462be6c2ed3dcfe (waited 6m46.081142462s)
    mcd_test.go:137: Pool worker has completed rendered-worker-632afc6a58c46eb8a5ea4054a798c11b (waited 4m2.083221173s)
=== RUN   TestCustomPool
--- **PASS: TestCustomPool (96.43s)**
    mcd_test.go:115: Pool infra has rendered config infra-host-file-acdcad81-ba27-4f16-9b0a-7b26bfc0cb2b with rendered-infra-526be5274d5b02f195710ba832d70198 (waited 6.162455103s)
    mcd_test.go:137: Pool infra has completed rendered-infra-526be5274d5b02f195710ba832d70198 (waited 1m18.082861217s)
    mcd_test.go:553: Node ip-10-0-128-101.ec2.internal has expected infra MC content
    mcd_test.go:137: Pool infra has completed rendered-infra-526be5274d5b02f195710ba832d70198 (waited 2.083914237s)
=== RUN   TestClusterOperatorRelatedObjects
--- PASS: TestClusterOperatorRelatedObjects (0.09s)
=== RUN   TestMastersSchedulable
--- PASS: TestMastersSchedulable (0.79s)
=== RUN   TestOSImageURL
--- PASS: TestOSImageURL (0.88s)
=== RUN   TestOperatorLabel
--- PASS: TestOperatorLabel (0.09s)
PASS
ok  	github.com/openshift/machine-config-operator/test/e2e	2432.196s

CustomPool test ran for 90s in comparision, in Colin's PR it ran for..1200s?!

@kikisdeliveryservice
Copy link
Contributor Author

In colin's PR total test time was 5567.370s
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1238/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/299

TestCustomPool took 1206s compared to my 90s?!

@cgwalters Something is super weird... I'd expect this obviously to have shorter time bc i took out FIPs test, but why is that other test that much faster?

@kikisdeliveryservice
Copy link
Contributor Author

Why are the tests run in CI so much slower across the board?

TestSSH 260 : 650
TESTRPM 654 : 914

@kikisdeliveryservice
Copy link
Contributor Author

im going to try to run the whole suite in my cluster (with none of our changes).. is gcp just slow?

@kikisdeliveryservice
Copy link
Contributor Author

kikisdeliveryservice commented Nov 5, 2019

stop bumping max unavailable + dropping FIPS => tests pass

Now trying to run FIPS test last and bumping the wait for poll time bc locally it failed but was still well under our e2e timeout of 120 minutes (failed runs were ~90), it timed out bc of our 20 minute wait time...

stop bumping max unavailable + bump wait for pool timeout+10 + Run Fips last (when no FIPs, TestCustomPool takes 90s so run it first!) => FIPS fails

@kikisdeliveryservice
Copy link
Contributor Author

/skip

1 similar comment
@kikisdeliveryservice
Copy link
Contributor Author

/skip

@kikisdeliveryservice
Copy link
Contributor Author

So I think I see the problem and it's not anything to do with our test.. changing the order of the tests kind of made it clear. From a preexisting test there is an infra pool and that gets updated wiht the fips config ok:

I1105 04:33:21.363249 1 render_controller.go:516] Pool infra: now targeting: rendered-infra-97724ba07895acd23c84456efad2a391
But when we move onto the worker pool... we hit problems!:

I1105 04:33:21.962602       1 render_controller.go:497] Generated machineconfig rendered-worker-5f52921054984ab957bb61867e67be74 from 11 configs: [{MachineConfig  00-worker  machineconfiguration.openshift.io/v1  } {MachineConfig  01-worker-container-runtime  machineconfiguration.openshift.io/v1  } {MachineConfig  01-worker-kubelet  machineconfiguration.openshift.io/v1  } {MachineConfig  99-worker-4585685f-c1ec-4d31-b12c-0b43d78ae76b-registries  machineconfiguration.openshift.io/v1  } {MachineConfig  99-worker-ssh  machineconfiguration.openshift.io/v1  } {MachineConfig  add-a-file-23bdb0f6-6db2-4367-a86d-ada8eb8cfa96  machineconfiguration.openshift.io/v1  } {MachineConfig  add-a-file-35c8981a-4338-4ba7-898c-1bdca422f3de  machineconfiguration.openshift.io/v1  } {MachineConfig  add-a-file-4f97785a-56a6-4d24-90c9-ab1e05827e03  machineconfiguration.openshift.io/v1  } {MachineConfig  fips-4c6de4f6-df92-4ef2-9fdf-32c6395abd36  machineconfiguration.openshift.io/v1  } {MachineConfig  kargs-4e5b3272-cd16-4a47-ad34-efd5db2f723e  machineconfiguration.openshift.io/v1  } {MachineConfig  sshkeys-worker-e8a4bac8-6101-4907-aad7-b514bdd2f290  machineconfiguration.openshift.io/v1  }]
E1105 04:33:21.975008       1 render_controller.go:459] Error updating MachineConfigPool worker: Operation cannot be fulfilled on machineconfigpools.machineconfiguration.openshift.io "worker": the object has been modified; please apply your changes to the latest version and try again
I1105 04:33:21.975038       1 render_controller.go:376] Error syncing machineconfigpool worker: Operation cannot be fulfilled on machineconfigpools.machineconfiguration.openshift.io "worker": the object has been modified; please apply your changes to the latest version and try again
I1105 04:33:21.996842       1 render_controller.go:516] Pool worker: now targeting: rendered-worker-5f52921054984ab957bb61867e67be74
I1105 04:33:26.363653       1 status.go:82] Pool infra: All nodes are updated with rendered-infra-97724ba07895acd23c84456efad2a391
I1105 04:34:00.515866       1 node_controller.go:433] Pool worker: node ci-op--9c5x6-w-b-th288.c.openshift-gce-devel-ci.internal is now reporting unready: node ci-op--9c5x6-w-b-th288.c.openshift-gce-devel-ci.internal is reporting OutOfDisk
I1105 04:35:01.233611       1 node_controller.go:433] Pool worker: node ci-op--9c5x6-w-b-th288.c.openshift-gce-devel-ci.internal is now reporting unready: node ci-op--9c5x6-w-b-th288.c.openshift-gce-devel-ci.internal is reporting NotReady
I1105 04:35:11.190225       1 node_controller.go:433] Pool worker: node ci-op--9c5x6-w-b-th288.c.openshift-gce-devel-ci.internal is now reporting unready: node ci-op--9c5x6-w-b-th288.c.openshift-gce-devel-ci.internal is reporting Unschedulable
I1105 04:35:12.243810       1 node_controller.go:442] Pool worker: node ci-op--9c5x6-w-b-th288.c.openshift-gce-devel-ci.internal has completed update to rendered-worker-0ce2ff2fbec6e4a67ab027b498d25b0b
I1105 04:35:12.255982       1 node_controller.go:435] Pool worker: node ci-op--9c5x6-w-b-th288.c.openshift-gce-devel-ci.internal is now reporting ready
I1105 04:35:16.190594       1 node_controller.go:754] Setting node ci-op--9c5x6-w-b-th288.c.openshift-gce-devel-ci.internal to desired config rendered-worker-5f52921054984ab957bb61867e67be74
I1105 04:35:16.208592       1 node_controller.go:452] Pool worker: node ci-op--9c5x6-w-b-th288.c.openshift-gce-devel-ci.internal changed machineconfiguration.openshift.io/desiredConfig = rendered-worker-5f52921054984ab957bb61867e67be74
I1105 04:35:16.665974       1 node_controller.go:452] Pool worker: node ci-op--9c5x6-w-b-th288.c.openshift-gce-devel-ci.internal changed machineconfiguration.openshift.io/state = Working
I1105 04:35:16.684721       1 node_controller.go:433] Pool worker: node ci-op--9c5x6-w-b-th288.c.openshift-gce-devel-ci.internal is now reporting unready: node ci-op--9c5x6-w-b-th288.c.openshift-gce-devel-ci.internal is reporting Unschedulable
I1105 04:36:45.678195       1 node_controller.go:433] Pool worker: node ci-op--9c5x6-w-b-th288.c.openshift-gce-devel-ci.internal is now reporting unready: node ci-op--9c5x6-w-b-th288.c.openshift-gce-devel-ci.internal is reporting OutOfDisk
I1105 05:03:23.998721       1 node_controller.go:457] Pool master: node ci-op--9c5x6-m-0.c.openshift-gce-devel-ci.internal changed labels
I1105 05:03:24.008836       1 node_controller.go:457] Pool master: node ci-op--9c5x6-m-1.c.openshift-gce-devel-ci.internal changed labels
I1105 05:03:24.020460       1 node_controller.go:457] Pool master: node ci-op--9c5x6-m-2.c.openshift-gce-devel-ci.internal changed labels
I1105 05:03:25.026321       1 node_controller.go:457] Pool master: node ci-op--9c5x6-m-0.c.openshift-gce-devel-ci.internal changed labels
I1105 05:03:25.035439       1 node_controller.go:457] Pool master: node ci-op--9c5x6-m-1.c.openshift-gce-devel-ci.internal changed labels
I1105 05:03:25.047254       1 node_controller.go:457] Pool master: node ci-op--9c5x6-m-2.c.openshift-gce-devel-ci.internal changed labels
I1105 05:03:49.470515       1 container_runtime_config_controller.go:713] Applied ImageConfig cluster on MachineConfigPool master
I1105 05:03:49.555436       1 container_runtime_config_controller.go:713] Applied ImageConfig cluster on MachineConfigPool worker

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1244/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/309/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-controller-f574fccd-qp7md_machine-config-controller.log

If you look in workers journal it never begins applying rendered-worker-5f52921054984ab957bb61867e67be74 it's not mentioned once!

@kikisdeliveryservice
Copy link
Contributor Author

It hits this container_runtime_config section and never seems to move on the same way that other machineconfigs are. Something is happening between rendering the worker config (which the MCO does correctly) and the actual application of the MC.

@kikisdeliveryservice
Copy link
Contributor Author

Removed all tests but the FIPs test and increased the waiteForPool to 60 min. If this can't pass now we have big problems.

@kikisdeliveryservice
Copy link
Contributor Author

/skip

@kikisdeliveryservice
Copy link
Contributor Author

Based on my local testing, TestFips will keep failing bc something is wrong with Day 2 Fips - this is not a matter of increasing test duration. I propose that we drop TestFips to unblock CI (for non-fips prs) and merge #1233 dropping Day 2 Fips altogether.

@kikisdeliveryservice
Copy link
Contributor Author

kikisdeliveryservice commented Nov 5, 2019

FTR: The run testing ONLY TestFIPs with a timeout of 60 minutes never proceeds past, just like every other run... There's not change we can make to get this test to pass.

1105 07:24:00.512345       1 render_controller.go:497] Generated machineconfig rendered-worker-0adeb275dc69f8b51a74936af1c80382 from 6 configs: [{MachineConfig  00-worker  machineconfiguration.openshift.io/v1  } {MachineConfig  01-worker-container-runtime  machineconfiguration.openshift.io/v1  } {MachineConfig  01-worker-kubelet  machineconfiguration.openshift.io/v1  } {MachineConfig  99-worker-a46d8f5b-440d-4472-90d3-1bee7ebd4e5a-registries  machineconfiguration.openshift.io/v1  } {MachineConfig  99-worker-ssh  machineconfiguration.openshift.io/v1  } {MachineConfig  fips-b25c4ef4-ee7d-468b-a70e-e7b194cc54ad  machineconfiguration.openshift.io/v1  }]
I1105 07:24:00.520553       1 render_controller.go:516] Pool worker: now targeting: rendered-worker-0adeb275dc69f8b51a74936af1c80382
I1105 07:24:05.521431       1 node_controller.go:754] Setting node ip-10-0-142-221.ec2.internal to desired config rendered-worker-0adeb275dc69f8b51a74936af1c80382
I1105 07:24:05.535919       1 node_controller.go:452] Pool worker: node ip-10-0-142-221.ec2.internal changed machineconfiguration.openshift.io/desiredConfig = rendered-worker-0adeb275dc69f8b51a74936af1c80382
I1105 07:24:06.549631       1 node_controller.go:452] Pool worker: node ip-10-0-142-221.ec2.internal changed machineconfiguration.openshift.io/state = Working
I1105 07:24:06.567117       1 node_controller.go:433] Pool worker: node ip-10-0-142-221.ec2.internal is now reporting unready: node ip-10-0-142-221.ec2.internal is reporting Unschedulable
I1105 07:25:45.678145       1 node_controller.go:433] Pool worker: node ip-10-0-142-221.ec2.internal is now reporting unready: node ip-10-0-142-221.ec2.internal is reporting OutOfDisk
I1105 07:44:26.092740       1 container_runtime_config_controller.go:713] Applied ImageConfig cluster on MachineConfigPool worker
I1105 07:44:26.125268       1 container_runtime_config_controller.go:713] Applied ImageConfig cluster on MachineConfigPool master

Day 2 FIPS is broken and this test is consistently failing.  Day 2 FIPS
will be dropped in openshift#1233 so this test will be unneeded.

Related-to: openshift#1233
@kikisdeliveryservice kikisdeliveryservice changed the title [WIP] e2e-gcp-op: testing some changes to see if we can pass test/e2e: Drop TestFIPS Nov 5, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 5, 2019
@kikisdeliveryservice
Copy link
Contributor Author

/skip

Copy link
Member

@ashcrow ashcrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the old, unsupported implementation here is being replaced there is no need to test it. 👍

@ashcrow
Copy link
Member

ashcrow commented Nov 5, 2019

/retest

@cgwalters
Copy link
Member

/approve
/lgtm
Probably some changes in FIPS mode in the node stack broke this; it'd be good to fully root cause this but we have higher priorities in fixing the "day 1" install; in the end we aren't going to be supporting "day 2" FIPS so indeed it doesn't make sense to test it.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 5, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ashcrow, cgwalters, kikisdeliveryservice

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [ashcrow,cgwalters,kikisdeliveryservice]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 5b3bfda into openshift:master Nov 5, 2019
@openshift-ci-robot
Copy link
Contributor

@kikisdeliveryservice: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws-scaleup-rhel7 a686c4a link /test e2e-aws-scaleup-rhel7

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants